We first need to load up the work that was previously done in deliverable 1.
In this deliverable I am going to scrape NBA game data from the web and use it to try and predict the outcomes of games.
The first thing that we need to do is scrape the data from the web.
We also want to make sure that the data we just scraped is in the correct type.
tib$home_pts <- as.numeric(tib$home_pts)
tib$visitor_pts <- as.numeric(tib$visitor_pts)
tib$attendance <- as.numeric(gsub(",", "",
tib$attendance))
tib$date_game <- mdy(tib$date_game)
After obtaining all of the data, we now want to organize it in a more condensed table. We are going to create a new table containing the date, home/away teams, their respective score, overtime, attendance and start time.
schedule <- tibble(date=tib$date_game,
home_team=tib$home_team_name,
home_score=tib$home_pts,
away_team=tib$visitor_team_name,
away_score=tib$visitor_pts,
overtimes=tib$overtimes,
attendance=tib$attendance,
start_time=tib$game_start_time)
schedule
In order to make observations easier, I am going to add each teams’ abbreviation to the schedule table. To do this, I first need to create a table with each name and abbreviation matched, which I can do using the ranking table.
ranking <- arrange(ranking,year)
abbrevs <- tibble(abbrev=ranking$team_abbrev,home_team=ranking$team_name,num = 1:nrow(ranking))
abbrevs$home_team<- as.factor(abbrevs$home_team)
name_levels <- levels(abbrevs$home_team)
levels(abbrevs$home_team)[abbrevs$home_team=="Seattle Supersonics"] <- "Seattle SuperSonics"
name_levels <- levels(abbrevs$home_team)
for(t in name_levels){
first_occur <- max(abbrevs$num[abbrevs$home_team==t])
abbrevs <- abbrevs %>% filter(!(home_team==t & num !=first_occur))
}
home_abbrev <- tibble(home_abbrev=abbrevs$abbrev,home_team=abbrevs$home_team)
away_abbrev <- tibble(away_abbrev=abbrevs$abbrev,away_team=abbrevs$home_team)
new_schedule <- schedule %>%
left_join(home_abbrev, by="home_team")
new_schedule <- new_schedule %>%
left_join(away_abbrev, by="away_team")
new_schedule
Due to almost half of the data not having values for overtimes, attendance and start time, I am going to remove those from the new_schedule table.
ind_game <- tibble(year=year(new_schedule$date),
home_abbrev=new_schedule$home_abbrev,
home_team=new_schedule$home_team,
home_score=new_schedule$home_score,
away_abbrev=new_schedule$away_abbrev,
away_team=new_schedule$away_team,
away_score=new_schedule$away_score
)
ind_game
For the initial model, I am going to try and predict total game score for the home team using the home team’s yearly made/attempted free throws, made/attempted shots,rebounds, assists, fouls, and points.
First though, we must add the yearly statistics to the ind_game table.
ind_game$year <- as.factor(ind_game$year)
new_game <- ind_game %>%
left_join(statistics, by=c("home_abbrev"="team_abbrev","year"))
## Warning: Column `year` joining factors with different levels, coercing to
## character vector
new_game
simple_model <- lm(new_game, formula= home_score ~ made_field_goal+
attempt_field_goal+made_free_throw+
attempt_free_throw+rebounds+
assists+fouls+points_scored)
summary(simple_model)
##
## Call:
## lm(formula = home_score ~ made_field_goal + attempt_field_goal +
## made_free_throw + attempt_free_throw + rebounds + assists +
## fouls + points_scored, data = new_game)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.298 -8.358 -0.356 7.909 67.869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.2252431 0.7373503 93.884 <2e-16 ***
## made_field_goal 0.0323311 0.0007468 43.294 <2e-16 ***
## attempt_field_goal -0.0050526 0.0002483 -20.347 <2e-16 ***
## made_free_throw 0.0143980 0.0009137 15.758 <2e-16 ***
## attempt_free_throw -0.0007963 0.0006837 -1.165 0.244
## rebounds 0.0017601 0.0001684 10.454 <2e-16 ***
## assists -0.0062197 0.0003902 -15.941 <2e-16 ***
## fouls -0.0098661 0.0003759 -26.250 <2e-16 ***
## points_scored -0.0037341 0.0002825 -13.217 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.29 on 50082 degrees of freedom
## (2954 observations deleted due to missingness)
## Multiple R-squared: 0.2802, Adjusted R-squared: 0.2801
## F-statistic: 2437 on 8 and 50082 DF, p-value: < 2.2e-16
This model should make sense because as offensive statistics go up, a team should score more points on a nightly basis. Based on the summary of the model, it seems as though all of the variables provided were good predictors of score except attempted free throws. This seems a little weird because logically, the more free throws taken the more points you should score. One thing that may be influencing the output is that there is some missing data in the form of zeros for some of the earlier years when that stat was not kept track of. This model also doesn’t take into account the the opposing team and it’s defensive/offensive stats. That is something that I think can be looked at in the future when making a more accurate model.